NSF PAR Search | NSF Public Access Repository

TaskVine: Managing In-Cluster Storage for High-Throughput Data Intensive Workflows

https://doi.org/10.1145/3624062.3624277

Sly-Delgado, Barry; Phung, Thanh Son; Thomas, Colin; Simonetti, David; Hennessee, Andrew; Tovar, Ben; Thain, Douglas (November 2023, ACM)

Many scientific applications are expressed as high-throughput workflows that consist of large graphs of data assets and tasks to be executed on large parallel and distributed systems. A chal- lenge in executing these workflows is managing data: both datasets and software must be efficiently distributed to cluster nodes; inter- mediate data must be conveyed between tasks; output data must be delivered to its destination. Scaling problems result when these actions are performed in an uncoordinated manner on a shared filesystem. To address this problem, we introduce TaskVine: a sys- tem for exploiting the aggregate local storage and network capacity of a large cluster. TaskVine tracks the lifetime of data in a workflow –from archival sources to final outputs– making use of local storage to distribute, and re-use data wherever possible. We describe the architecture and novel capabilities of TaskVine, and demonstrate its use with applications in genomics, high energy physics, molecular dynamics, and machine learning.

Full Text Available

Search for: All records